DBpedia Spotlight
Introduction
DBpedia Spotlight is a tool for automatically annotating mentions of DBpedia resources in text, providing a solution for linking unstructured information sources to the Linked Open Data cloud through DBpedia. DBpedia Spotlight recognizes that names of concepts or entities have been mentioned (e.g. “Michael Jordan”), and subsequently matches these names to unique identifiers (e.g. dbpedia:Michael_I._Jordan, the machine learning professor or dbpedia:Michael_Jordan the basketball player). It can also be used for building your solution for Named Entity Recognition, Keyphrase Extraction, Tagging, etc. amongst other information extraction tasks.
Text annotation has the potential of enhancing a wide range of applications, including search, faceted browsing and navigation. By connecting text documents with DBpedia, our system enables a range of interesting use cases. For instance, the ontology can be used as background knowledge to display complementary information on web pages or to enhance information retrieval tasks. Moreover, faceted browsing over documents and customization of web feeds based on semantics become feasible. Finally, by following links from DBpedia into other data sources, the Linked Open Data cloud is pulled closer to the Web of Documents.
Take a look at our Known Uses page for other examples of how DBpedia Spotlight can be used. If you use DBpedia Spotlight in your project, please add a link to http://spotlight.dbpedia.org. If you use it in a paper, please use the citation available here.
You can try out DBpedia Spotlight through our Web Application or Web Service endpoints. The Web Application is a user interface that allows you to enter text in a form and generates an HTML annotated version of the text with links to DBpedia. The Web Service endpoints provide programmatic access to the demo, allowing you to retrieve data also in XML or JSON.
Glossary
Context: the context refers to the “the parts of something written or spoken that immediately precede and follow a word or passage and clarify its meaning.”
OntologyClass: an ontology class represents a set of resources sharing similar characteristics. Resources can be of several types: Person, Organisation, Location, FloweringPlant, etc. All of these classes are organized in a domain model (i.e. schema, ontology). The “type” or the “ontology class” of a resource comes from this ontology.
Phrase Recognition: See Spotting.
Resource: a resource is any entity or concept in our target knowledge base (e.g. DBpedia). We take this name from RDF (Resource Description Framework), as a generic name for things, concepts, ideas “that can be identified on the Web, even when they cannot be directly retrieved on the Web.”
Spotting: We call Spotting or Phrase Recognition the task of selecting, from some textual document given as input, phrases that should be annotated by the system. This is closely related to Keyphrase Extraction and Named Entity Recognition, for instance. In Keyphrase Extraction, the system tries to guess the “important” phrases, according to some definition of importance. Meanwhile, in Named Entity Recognition, the system focuses on specific entity types (commonly Person, Location and Organization), and the notion of importance is usually irrelevant. We describe some of these and several other strategies for phrase recognition below.
SurfaceForm: a surface form is the phrase used to refer to a resource in text. For example: “Barack Obama”, “President Obama” and “Obama” are all surface forms for the resource
dbpedia:Obama
.Token: each individual element extracted after tokenizing the text more. Tokens are the individual words in the context, or slightly modified versions of these words (e.g. running -> run)
Topic: a topic is a broad categorization of knowledge into areas of interest. For example, text can belong to Business, Politics, Sports or Arts topics.
User’s manual
DBpedia Spotlight is a tool for annotating mentions of DBpedia concepts in plain text.
We offer three basic functions: Annotate, Disambiguate and Candidates (Best K). They can be accessed from a Scala/Java API, REST Web Service and from a user interface on the Web (HTML/Javascript). For the Scala/Java API, there are a number of configuration parameters that can be used to instruct the annotation and disambiguation functions. The classes DefaultAnnotator, DefaultDisambiguator and DefaultParagraphDisambiguator offer the configuration that we found to provide the best results. The configuration interface offers ways to control the quality of the output of the two above tasks.
Architecture
The DBpedia Spotlight Architecture is composed by the following modules:
- Web application, a demonstration client (HTML/Javascript interface) that allows users to enter/paste text into a Web browser and visualize the resulting annotated text.
- Web Service, a RESTful/SOAP? Web API that exposes the functionality of annotating and/or disambiguating entities in text.
- Annotation Java/Scala API, exposing the underlying logic that performs the annotation/disambiguation.
- Indexing Java/Scala API, executing the data processing necessary to enable the annotation/disambiguation algorithms used.
- Evaluation module, where we test disambiguators, log results and use those to train our system to perform better.
External dependencies:
- DBpedia Extraction Framework, (only for the index module) extracting the necessary data from the Wikipedia dumps.
- Lucene 2.9.3, providing the low level indexing framework used by DBpedia Spotlight.
- LingPipe 4.0.0, providing the string matching implementation used for the Spotter module.
System Requirements
- Java 1.6+
- Scala 2.9+
- Spotlight JAR
- Spotlight Library JARs
- Lucene disambiguation index
- Spotter dictionary
- large RAM to set the heap size big enough for the Spotter (approx. 8G)
- Maven 3 for the automagic installation of dependencies.
- Indexing Java/Scala API, executing the data processing necessary to enable the annotation/disambiguation algorithms used.
Programmatic usage
If you want to use DBpedia Spotlight in your Java/Scala code, take a look at core/SpotlightFactory to see how you can create your objects, and then look at rest/Candidates.java to see how you can wire them together.
Online Usage
Refer to User’s manual。
Content Negotiation
You can request different types of output by setting the Accept request header. For example, in order to request JSON output, you can add Accept:application/json to the request headers.
One example using cURL:
1 | curl "http://spotlight.dbpedia.org/rest/annotate?text=President%20Michelle%20Obama%20called%20Thursday%20on%20Congress%20to%20extend%20a%20tax%20break%20for%20students%20included%20in%20last%20year%27s%20economic%20stimulus%20package,%20arguing%20that%20the%20policy%20provides%20more%20generous%20assistance.&confidence=0.2&support=20"\ |
The content types we currently support are:
- text/html
- application/xhtml+xml
- text/xml
- application/json
The application/xhtml+xml
comes with embedded RDFa that you can give to the RDFa Distiller and get RDF triples in Turtle, RDF+XML, etc. as output.
If your input text is long, you may prefer using POST instead of GET.
1 | curl -i -X POST \ |
Please note that you must use content-type application/x-www-form-urlencoded for POST requests.
The following are 4 examples, each consists of a query url and the result.
Example 1: without type restriction
1 | http://spotlight.dbpedia.org/rest/annotate?text=President%20Obama%20called%20Wednesday%20on%20Congress%20to%20extend%20a%20tax%20break%20for%20students%20included%20in%20last%20year%27s%20economic%20stimulus%20package,%20arguing%20that%20the%20policy%20provides%20more%20generous%20assistance.&confidence=0.2&support=20 |
returns the XML
1 | <Annotation text="President Obama called Wednesday on Congress to extend a tax break |
Example 2: with type restriction
1 | http://spotlight.dbpedia.org/rest/annotate?text=President%20Obama%20called%20Wednesday%20on%20Congress%20to%20extend%20a%20tax%20break%20for%20students%20included%20in%20last%20year%27s%20economic%20stimulus%20package,%20arguing%20that%20the%20policy%20provides%20more%20generous%20assistance.&confidence=0.2&support=20&types=Person,Organisation |
returns the XML
1 | <Annotation text="President Obama called Wednesday on Congress to extend a tax break |
Example 3: with SPARQL restriction
1 | http://spotlight.dbpedia.org/rest/annotate?text=President%20Obama%20called%20Wednesday%20on%20Congress%20to%20extend%20a%20tax%20break%20for%20students%20included%20in%20last%20year%27s%20economic%20stimulus%20package,%20arguing%20that%20the%20policy%20provides%20more%20generous%20assistance.&confidence=0.2&support=20&sparql=SELECT+DISTINCT+%3Fx%0D%0AWHERE+%7B%0D%0A%3Fx+a+%3Chttp%3A%2F%2Fdbpedia.org%2Fontology%2FOfficeHolder%3E+.%0D%0A%3Fx+%3Frelated+%3Chttp%3A%2F%2Fdbpedia.org%2Fresource%2FChicago%3E+.%0D%0A%7D |
returns the XML
1 | <Annotation text="President Obama called Wednesday on Congress to extend a tax break |
Example 4: Candidates Interface
The parameters are the same as in Example 1, but you will send your request to http://spotlight.dbpedia.org/rest/candidates
1 | http://spotlight.dbpedia.org/rest/candidates?text=President%20Obama%20called%20Wednesday%20on%20Congress%20to%20extend%20a%20tax%20break%20for%20students%20included%20in%20last%20year%27s%20economic%20stimulus%20package,%20arguing%20that%20the%20policy%20provides%20more%20generous%20assistance.&confidence=0.2&support=20 |
returns XML
1 | <annotation text="President Obama on Monday will call for a new minimum tax rate for individuals making more than $1 million a year to ensure that they pay at least the same percentage of their earnings as other taxpayers, according to administration officials. "> |
Installation
Refer to Installation.
Web service
This page gives an introduction on how to use the DBpedia Spotlight Web Service. The available service endpoints are listed below and described in more details in the User’s Manual.
Spotting
Spotting : takes text as input and recognizes entities/concepts to annotate. Several spotting techniques are available, such as dictionary lookup and Named Entity Recognition (NER).
Disambiguate
Disambiguation: takes spotted text input, where entities/concepts have already been recognized and marked as wiki markup or xml. Chooses an identifier for each recognized entity/concept given the context.
Supported types (POST/GET):XML, JSON, HTML, RDFa, NIF
Annotate
Annotation: runs spotting and disambiguation. Takes text as input, recognizes entities/concepts to annotate and chooses an identifier for each recognized entity/concept given the context.
Supported types (POST/GET):XML, JSON, HTML, RDFa, NIF
Candidates
Similar to annotate, but returns a ranked list of candidates instead of deciding on one. These list contains some properties as described below:
- support: how prominent is this entity, i.e. number of inlinks in Wikipedia;
- priorScore: normalized support;
contextualScore: score from comparing the context representation of an entity with the text (e.g. cosine similartity with if-icf weights); - percentageOfSecondRank: measure by how much the winning entity has won by takingcontextualScore_2ndRank / contextualScore_1stRank, which means the lower this score, the further the first ranked entity was “in the lead”;
- finalScore: combination of all of them;
Supported types (POST/GET):XML, JSON
Examples
Example 1: Simple request
- text= “President Obama called Wednesday on Congress to extend a tax break for students included in last year’s economic stimulus package, arguing that the policy provides more generous assistance.”
- confidence = 0.2; support=20
- whitelist all types.
1 | curl http://spotlight.dbpedia.org/rest/annotate \ |
Example 2: Using SPARQL for filtering
This example demonstrates how to keep the annotations constrained to only politicians related to Chicago.
- text= “President Obama called Wednesday on Congress to extend a tax break for students included in last year’s economic stimulus package, arguing that the policy provides more generous assistance.”
- confidence = 0.2; support=20
whitelist sparql = SELECT DISTINCT ?politician WHERE { ?politician a http://dbpedia.org/ontology/officeholder/http://dbpedia.org/ontology/officeholder . ?politician ?related http://dbpedia.org/resource/chicago/http://dbpedia.org/resource/chicago }
1 | curl http&#58;//spotlight.dbpedia.org/rest/annotate \ |
Notice: Due to system resources restrictions, for this demo we only use the first 2000 results returned for each query (default for the public DBpedia SPARQL endpoint). However you are welcome to download the software+data and install in your server for real world use cases.
Attention: Make sure to encode your SPARQL query before adding it as the value of the //&sparql// parameter - see java.net.URLEncoder.encode().
Content Negotiation
You can request different types of output by setting the Accept request header. For example, in order to request JSON output, you can add Accept:application/json to the request headers.
One example using cURL:
1 | curl "http://spotlight.dbpedia.org/rest/annotate?text=President%20Michelle%20Obama%20called%20Thursday%20on%20Congress%20to%20extend%20a%20tax%20break%20for%20students%20included%20in%20last%20year%27s%20economic%20stimulus%20package,%20arguing%20that%20the%20policy%20provides%20more%20generous%20assistance.&confidence=0.2&support=20" -H "Accept:application/json" |
The content types we currently support are:
- text/html
- application/xhtml+xml
- text/xml
- application/json
The application/xhtml+xml comes with embedded RDFa that you can give to the RDFa Distiller and get RDF triples in Turtle, RDF+XML, etc. as output.
If your input text is long, you may prefer using POST instead of GET.
1 | curl -i -X POST \ |
Please not that you must use content-type application/x-www-form-urlencoded for POST requests.
Run from a JAR
This page describes how to run DBpedia Spotlight in your own server by using a pre-packaged JAR. We assume that you are running these commands on a bash command line (Linux) and have wget, curl and java installed.
Requirements
- Java 1.6+
- RAM of appropriate size for the spotter lexicon you need
Quickstart
The commands below will help you to obtain a pre-packaged lightweight deployment to get you started.
Lucene:
1 | wget http://spotlight.dbpedia.org/download/release-0.6/dbpedia-spotlight-quickstart-0.6.5.zip |
Older jars are downloadable from: https://github.com/dbpedia-spotlight/dbpedia-spotlight/downloads
Statistical:
1 | wget http://spotlight.sztaki.hu/downloads/version-0.1/en.tar.gz |
Test your installation
In order to test your new installation, run:
1 | curl http://localhost:2222/rest/annotate \ |
Now you can study more about how to call your newly installed Web Service, which parameters are accepted, etc. here.
Upgrade your models
Lucene:
The files you’ve downloaded above contain only a very small subset of the DBpedia resources. They are used to demonstrate DBpedia Spotlight in a lightweight environment. Please see our Downloads for more information on other alternatives that are more useful in real world scenarios. See below one example.
First rename your small model files:
1 | mv data/index data/index-small |
Now obtain new copies with larger models:
1 | cd data |
If you are using the largest spotter dict, you may need to increase the java heap space — e.g. -Xmx10G in your command line.
Statistical:
We offer only the complete model with this option. You can download the newest models from http://spotlight.sztaki.hu/downloads/
Two Backend version
Statistical backend
Refer to Statistical backend.
Lucene backend
Refer to Lucene backend).